13. Pandas-groupby用法

分组聚合：数据分析的核心范式

Split-Apply-Combine（分组-应用-合并）是数据分析中最重要的操作模式：

Split：按某个标准将数据分为多个组
Apply：对每个组独立地应用函数
Combine：将各组结果合并为新的数据结构

分组聚合的商业应用场景

按行业统计股票表现
按时间段计算收益率
按市场或区域分组分析风险
按客户类型汇总销售额

⭐ groupby基础

Listing 1

# ⚠️ 平台原始代码 - 请原样输入至教学平台（注释除外），平台才会判定答案正确
import pandas as pd  # 导入Pandas数据分析库
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',  # 创建数据框df
             'foo', 'bar'],  # A列分组标签的续行数据
          'B' : ['one', 'one', 'two', 'three',  # "B"的数据序列
             'two', 'two'],  # B列分组标签的续行数据
          'C' : [1, 5, 5, 2, 5, 5],  # "C"的数据序列
          'D' : [2.0, 5., 8., 1., 2., 9.]})  # "D"的数据序列
grouped = df.groupby('A')[['C', 'D']]  # 按指定列分组聚合
result=grouped.transform(lambda x: (x - x.mean()) / x.std())  # 定义匿名函数result
print(result)  # 输出分析结果数据

          C         D
0 -1.154701 -0.577350
1  0.577350  0.000000
2  0.577350  1.154701
3 -1.154701 -1.000000
4  0.577350 -0.577350
5  0.577350  1.000000

代码解析：groupby 的核心逻辑

df.groupby('A')：按 A 列的值将数据分组
[['C', 'D']]：选择 C、D 两列进行后续操作
transform()：对每组应用函数，返回与原数据形状相同的结果
lambda x: (x - x.mean()) / x.std()：标准化公式

聚合函数：均值与求和

Listing 2

# 选择特定列分组
grouped = df.groupby('A')[['C', 'D']]

# 计算每组的均值
print('均值:')
print(grouped.mean())

# 计算每组的总和
print(f'\n求和:')
print(grouped.sum())

均值:
            C    D
A                 
bar  4.000000  5.0
foo  3.666667  4.0

求和:
      C     D
A            
bar  12  15.0
foo  11  12.0

聚合函数：计数与标准差

Listing 3

# 计算每组的非缺失值计数
print('计数:')
print(grouped.count())

# 计算每组的标准差（衡量数据波动性）
print(f'\n标准差:')
print(grouped.std())

计数:
     C  D
A        
bar  3  3
foo  3  3

标准差:
            C         D
A                      
bar  1.732051  4.000000
foo  2.309401  3.464102

同时应用多个聚合函数：agg()

Listing 4

# agg() 可接收函数列表，同时计算多个统计量
print(grouped.agg(['mean', 'std', 'min', 'max']))

            C                      D                    
         mean       std min max mean       std  min  max
A                                                       
bar  4.000000  1.732051   2   5  5.0  4.000000  1.0  9.0
foo  3.666667  2.309401   1   5  4.0  3.464102  2.0  8.0

agg() 是聚合操作中最灵活的方法
可传入函数列表，一次性计算均值、标准差、最小值、最大值

transform 转换操作

transform 返回与原数据相同形状的结果，适合组内标准化等场景。

Listing 5

# 标准化公式: (x - mean) / std
result = grouped.transform(lambda x: (x - x.mean()) / x.std())
print('标准化结果:')
print(result)

标准化结果:
          C         D
0 -1.154701 -0.577350
1  0.577350  0.000000
2  0.577350  1.154701
3 -1.154701 -1.000000
4  0.577350 -0.577350
5  0.577350  1.000000

验证标准化结果

Listing 6

# 验证：每组均值应约等于0
print('验证-每组均值:')
print(result.groupby(df['A']).mean())

# 验证：每组标准差应约等于1
print(f'\n验证-每组标准差:')
print(result.groupby(df['A']).std())

验证-每组均值:
                C    D
A                     
bar  0.000000e+00  0.0
foo  7.401487e-17  0.0

验证-每组标准差:
       C    D
A            
bar  1.0  1.0
foo  1.0  1.0

金融应用：组内标准化可消除不同股票的价格量纲差异，便于横向比较

filter 过滤分组

filter() 基于组的统计特征过滤整个组（不是过滤行）。

Listing 7

# 保留 C 列均值 > 3 的整个组
filtered = df.groupby('A').filter(lambda x: x['C'].mean() > 3)
print('过滤后的数据:')
print(filtered)

过滤后的数据:
     A      B  C    D
0  foo    one  1  2.0
1  bar    one  5  5.0
2  foo    two  5  8.0
3  bar  three  2  1.0
4  foo    two  5  2.0
5  bar    two  5  9.0

逻辑：如果某组的 C 列平均值 > 3，则保留该组的所有数据
金融应用：筛选出交易量活跃的股票、过滤掉小客户群体

apply 灵活应用自定义函数

apply() 可对每组应用任意复杂的自定义函数。

Listing 8

# 自定义函数：找出每组 D 列最大的行
def get_max_row(group):
    """找出分组中D列最大的行"""
    return group.loc[group['D'].idxmax()]

result = df.groupby('A').apply(get_max_row)
print('每组D列最大的行:')
print(result)

每组D列最大的行:
       A    B  C    D
A                    
bar  bar  two  5  9.0
foo  foo  two  5  8.0

groupby 四大操作总结

方法	功能	返回形状
`agg()`	聚合统计（均值、求和等）	每组一行
`transform()`	组内转换（标准化等）	与原数据相同
`filter()`	按条件过滤整个组	原数据子集
`apply()`	灵活应用任意函数	取决于函数